We have fitted VGLMs and VGLMMs at this point in the workshop. Hopefully, it is now clear to you that when we fit these models we make assumptions about the data generating process. If these assumptions do not hold, there are consequences for the conclusions that we connect to the results (they are most likely wrong). So, we need to understand the assumptions that we make, and more importantly, we need to know how to check and address assumption violations.
With assumptions, we can distinguish between statistical assumptions and ecological assumptions. Here, we will first focus on the former, but the latter is not less important. Statistical assumptions include those listed for VGLMs and VGLMMs in the previous practicals and in the lectures:
With ecological assumptions I mean the hypotheses and understandings we convey when we formulate a model. We validate and check those with things like model comparison (information cirteria, likelihood ratio test) and predictive comparison (more on that at the end of the day).
As for data, you are free to pick a dataset again, but I suggest you take one that has some flexibility or ambiguity as to the right response distribution. For biomass you could work with Tweedie or gamma, for count data it is usually Poisson or negative-binomial, for binary data the response distribution is binomial, but there is choice in the link function: it can be logit or probit. Every data type has a natural response distribution connected to it, usually based on its domain (i.e., the limits of the data; biomass cannot be zero, counts cannot be negative and binary cannot be different from 0 or 1).
I will again start with the waddensea (abundance) data, because it allows me to demonstrate the consequences of not accommodating overdispersion, and how to check for overdispersion.
Y <- read.table("../../data/waddenY.csv", sep="," ,header=TRUE, row.names = 2)[,-1]
X <- read.table("../../data/waddenX.csv", sep=",", header=TRUE, row.names = 2)[,-1]
X <- X[,-which(apply(X,2, anyNA))] # remove column with NAs
The natural point of departure for count data is the Poisson distribution, or the binomial distribution if you want to condition on the total count.
Let’s fit a few of the models from the previous exercises, so we can examine them more closely:
library(gllvm)
model1 <- gllvm(Y, X, formula = ~scale(silt_clay) + scale(elevation), family = "poisson", num.lv = 0)
model2 <- gllvm(Y, X, formula = ~scale(silt_clay) + scale(elevation), family = "negative.binomial", num.lv = 0)
model3 <- gllvm(Y, X, formula = ~(0+scale(silt_clay)|1) + (0+scale(elevation)|1), family = "poisson", num.lv = 0)
model4 <- gllvm(Y, X, formula = ~(0+scale(silt_clay)|1) + (0+scale(elevation)|1), family = "negative.binomial", num.lv = 0)
Similar to a linear regression, we can use plot function
to visualize the residuals from a gllvm type object. There is
also a residuals function if you actually want to have the
residuals, but that is rarely needed in practice. This plot
function makes five plots:
Here the residual is defined as the Dunn-Smyth residual, also referred to as the randomized quantile residual. It is the gold standard of residuals when it comes to complex statistical models, is straightforward to define for all our statistical models, and is exactly normally distributed even in small samples, regardless of the response distribution that we select. It has a random component to is, which means that your residual will not look exactly the same every time you use the function, but there will only be minor differences that should not affect your conclusion.
Residual diagnostics is pretty straightforward: there should be no odd looking patterns in the plots. If there are, the conclusions we connect to it, and the consequences it has, depend on the exact assumption that is violated. Safe to say, if all assuptions are met, we do not need to concern ourselves further with the details, so let’s focus on that instead!
We check the following assumptions with the following plots:
There is some grey area here; assumptions violations can be extreme and easy to spot, or subtle and difficult to conclude. If you are afraid an assumption is violated, the safe thing to do is to relax the assumption to the best of your ability by adjusting the model, and seeing if the results have changed. If they haven’t, the assumption was not violated or it simply didn’t matter for your results.
Each of the aforementioned assumptions tends to result in particular patterns in the plots; most violations will show in the first plot, but particularly systematic depature of the model. If you select the wrong distribution the QQ-plot will show departure from the diagonal line. If there is dependence of observations; usually spatially or temporally of sites, this will show in the third plot as clustering. Similarly, similarity of species (e.g., due to relatedness) will show as clustering in the fourth plot. Outliers will show in all the plots; particularly in the fifth plot as points that lay high above the line.
In our case, if we look at the QQ-plot of the first model:
plot(model1, which = 2)
We see that in the tails of the distribution there is a lot of deviation; the absolute size of the residuals is much larger than what corresponds to the Poisson distribution, so that the distributional assumption is violated. Also from some of the other plots it is not hard to determine that the model is a poor fit:
plot(model1, which = c(1, 3, 4, 5))
The residuals vs. linear predictors (fitted) shows a fan pattern that is indicative of overdispersion, residuals vs. rows shows some very extreme residuals for particular species, which we also see in the residuals vs. columns plot. The scale-location plot leads to the same conclusion: the variance function is not correct. If you work with count data in community ecology, this would quickly lead you to the conclusion that a negative-binomial model will fit better, which we can verify by plotting the residuals of the second model.
plot(model2, which = c(1,2))
The QQ-plot shows no departure and the residuals vs. linear predictors plot shows no extreme outliers or patterns. That does not make it a good model, just one with valid assumptions!
Those five plots and assumptions cover assumption checking over the overall model residuals. We have not yet checked the normality assumption of the estimated random effects. It is something that is very often overlooked in mixed-effects models. There is a difference in the prior assumption for the distribution of the random effect, and the posterior (in Bayesian language by lack of a more succint frequentist alternative). Technically, in frequentist statistics, we distinguish between the distribution of the random effect before the model has encountered the data, and after. The former comes in as the normality assumption of the random effect, which you may have heard other people speak of. The latter is by design much more flexible, and its exact form and implementation vary depending on how we exactly estimate the model.
Here, we can check the prior assumption of the random effect using the random effects estimates. It is important to note that these assumptions are similar to checking assumptions of a linear regression using the residual. There, the residual is the estimate of the error in the linear regression, as here our estimates are for the prior random effects. The assumptions are also very similar; for the random effects we assume homoscedasticity, independence to some degree, no outliers, and normality. We are slightly more limited in checking those, as we do not have “data” observed for the random effects; they are missing or unobserved and we only have the estimates. We can, however, plot them in a histogram, QQ-plot, and against the species to assess our assumptions. Here we go:
re <- coef(model3, "Br")
par(mfrow=c(1,nrow(re)))
# qqplot
for(i in 1:nrow(re)){
qqnorm(re[i,], main = row.names(re)[i])
qqline(re[i,])
}
# histogram
for(i in 1:nrow(re)){
hist(re[i,], main = row.names(re)[i], xlab = "Random effect")
}
# RE vs species
for(i in 1:nrow(re)){
plot(re[i,], x = 1:ncol(model3$y), main = row.names(re)[i], xlab = "Species", ylab = "Random effect")
}
We can clearly see some devation from normality in the random effects for elevation. This probably has to due with the misspecification of the response distribution, which the model is trying to accommodate. Let’s check again for the negative-binomial VGLMM:
re <- coef(model4, "Br")
par(mfrow=c(1,nrow(re)))
# qqplot
for(i in 1:nrow(re)){
qqnorm(re[i,], main = row.names(re)[i])
qqline(re[i,])
}
That looks somewhat better. In practice, there is very little we can do
about this. Deviation from normality indicates that the tails of the
estimated distribution are heavier than expected by normality, and
particularly for the previous QQ-plot, that the estimated distribution
has strong skew. In practice, it usually means that the model we have is
not particularly good; we can add more covariates, or change its
structure, to make things nicer behaved. Changing the random effects
distribution is very complex, and little software to do so is available,
so if changing the model does not work, a full Bayesian analysis would
probably be the next step (but trust me, changing the model will work!).
The consequences of the violation are not (too) dire; it mostly means
that the estimate for the variance of the random effect can be biased,
and the prediction intervals of the random effects might be inaccurate.
The direction of the bias (upward or downward) will depend on the exact
nature of the assumption violation.
Here is what I want you to do:
Residual diagnostics are a very important part to model checking, why would we compare models if one of them is not a valid model in the first place?
However, there are different ways to approach your final workflow; we could first perform model selection and cross our fingers that the final model has all its assumptions met, so that we don’t have to reiterate our process. Ultimately, model fitting, validation, and comparison is a process with an iterative nature. We fit models, we compare or check them, and we refine.
Anyway, in this second part of the practical I would like us to compare some models. This can be fixed effects models as you fitted yesterday, or random effects models, but keep in mind the limitations of these methods when it comes to random effects, so that exhaustive model comparison should generally be avoided. Likelihood ratio tests should not be use to compare models when the null is “on the boundary”. In practice, this means that you should not use it to test if including a new random effects improves your model. The approximation for the likelihood ratio test implemented in gllvm also assumes that the log-likelihood is quadratic. In practice, this means that it is relatively safe to test for if including new fixed effects improves your model, although there are some theoretical underpinnings that say it might sometimes go wrong. Information criteria (as in AIC) rely on the same assumptions.
Exhaustive model comparison will only lead us into trouble due to Freedman’s paradox (by chance, some variables will associate to the noise in our response data, and thus sometimes show as statistically significant effects). Make sure not to combine model selection with information criteria, and hypothesis testing, in practice as these are two completely different paradigms that should not be mixed (you will end up p-hacking; information criteria naturally selects models with statistical significance as the concepts are related).
The functions we can use for model comparison in gllvm
are:
summary (wald-statistics)AIC, BIC or AICc for
information criteria (all with a slightly different angle, but similar
in interpretation)anova for (approximate) likelihood ratio test of
modelsThe anova method has as main limitations that 1) the
model need to be nested, 2) the number of parameters difference can not
be too large, the null hypothesis cannot be on the boundary of the
parameter space (such as a zero variance estimate because you are
omitting species-specific random effects). Especially regarding 2) the
function will throw a (relative conservative) warning. For example, we
may want to compare the first model to a model without elevation:
model5 <- gllvm(Y, X, formula = ~scale(silt_clay), family = "poisson", num.lv = 0)
anova(model1, model5)
## Warning in anova.gllvm(model1, model5): This test was not designed for tests with a df.diff larger than 20 so the P-value should be treated as approximate.
## Model 1 : ~ scale(silt_clay)
## Model 2 : ~ scale(silt_clay) + scale(elevation)
## Resid.Df D Df.diff P.value
## 1 9280 0.00 0
## 2 9222 10717.18 58 0
The warning is not because the test does not work. This has to do with the fact that when we include additional species-specific fixed effects, there is a large number of parameters added to the model. The difference in the number of parameters for a model with and without a single covariate is given by the number of species, and we often have many. The likelihood improves with every parameter that we add, so a single covariate difference often means a considerable change in the likelihood.
A safer comparison, could be for the random-effects model with and without the species-common effect, for example:
model6 <- gllvm(Y, X = X, formula = ~(0+scale(silt_clay)|1) + (0+scale(elevation)|1),
studyDesign = X, row.eff = ~scale(silt_clay), family = "poisson", num.lv = 0)
has one parameter less than the third model; the species-common
effect for elevation is omitted by “overruling” the default of included
species-common effects for all random effects, by using the
row.eff formula argument. Now, our hypothesis test only
looks at the improvement in the model due to a single parameter: the
average effect for elevation. So, the question we are asking is “are
species responses to elevation on average 0” (or, “does including an
average effect of elevation improve our model”):
anova(model3, model6)
## Model 1 : y ~ NULL
## Model 2 : y ~ NULL
## Resid.Df D Df.diff P.value
## 1 9335 0.000000000 0
## 2 9334 0.008250123 1 0.927628
The answer is no, no it does not; the p-value is considerably above
the 0.05 threshold. AIC gives us essentially the same
answer:
AIC(model3, model6)
## df AIC
## model3 62 71689.83
## model6 61 71687.84
the model with an additional parameter is barely 2 points better, so there is no noticeable improvement. This is not a particularly big surprise, as the species-common effect for elevation was not statistically significant:
summary(model3)
##
## Call:
## gllvm(y = Y, X = X, formula = ~(0 + scale(silt_clay) | 1) + (0 +
## scale(elevation) | 1), family = "poisson", num.lv = 0)
##
## Family: poisson
##
## AIC: 71689.83 AICc: 71690.67 BIC: 72133.01 LL: -35783 df: 62
##
## Informed LVs: 0
## Constrained LVs: 0
## Unconstrained LVs: 0
##
## Formula: ~(0 + scale(silt_clay) | 1) + (0 + scale(elevation) | 1)
## LV formula: ~ 0
## Row effect: ~ 1
##
## Random effects:
## Name Variance Std.Dev
## scale.silt_clay. 0.5981 0.7734
## scale.elevation. 1.0211 1.0105
##
## Coefficients predictors:
## Estimate Std. Error z value Pr(>|z|)
## scale.silt_clay. -0.02381 0.09576 -0.249 0.804
## scale.elevation. 0.01332 0.14672 0.091 0.928
Although we have to separate out some of these different concepts of hypothesis testing and model selection with information criteria, I am wildly mixing them here for demonstration that they are related paradigms that often give us similar answers anyway.
anova if your analysis is confirmatory, or
information criteria if your analysis is exploratory, to determine which
covariate(s) in the data drive your community.